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Abstract 

This paper considers the sparse eigenvalue problem, which is to extract dominant (largest) 
sparse eigenvectors with at most k non-zero components. We propose a simple yet effective 
solution called truncated power method that can approximately solve the underlying noncon- 
vex optimization problem. A strong sparse recovery result is proved for the truncated power 
method, and this theory is our key motivation for developing the new algorithm. The proposed 
method is tested on applications such as sparse principal component analysis and the dens- 
est fc-subgraph problem. Extensive experiments on several synthetic and real-world large scale 
datasets demonstrate the competitive empirical performance of our method. 

1 Introduction 

Given a p x p symmetric positive semidefinite matrix A, the largest k-sparse eigenvalue problem 
aims to maximize the quadratic form x'^ Ax with a sparse unit vector x G with no more than k 
non-zero elements: 

Amax(^, fc) = maxx~^Ax, subject to ||x|| = 1, ||x||o < k, (1.1) 



where || • || denotes the ^2-iiorm, and || • ||o denotes the io-novm which counts the number of non-zero 
entries in a vector. The sparsity is controlled by the values of k and can be viewed as a design 
parameter. In machine learning applications, e.g., principal component analysis, this problem is 
motivated from the following perturbation formulation of matrix A: 

A = A + E, (1.2) 

where A is the empirical covariance matrix, A is the true covariance matrix, and E is a random 
perturbation due to having only a finite number of empirical samples. If we assume that the largest 
eigenvector x of ^ is sparse, then a natural question is to recover x from the noisy observation A 
when the error E is "small". In this context, the problem (|l.ip is also referred to as sparse principal 
component analysis (sparse PC A). 
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In general, problem (jl.ip is non-convex. In fact, it is also NP-hard be cause it can be re- 



duce d to the subset selection problem for ordinary least squares regression (jMoghaddam et al. 



20061 ). which is known to be NP hard. Various researchers have proposed approximate optimiza- 



tion i nethods: some are based on greedy procedures (e.g., iMoghaddam et al.l . l2006l : Ijolliffe et al. 



20031 : Id'Aspremont et al.l. l2008l), and some othe r s are based on various types of conve x relaxation 



or reformulation (e.g., d'Aspremont et al. . 2007 : Zou et al. . 20061 : Journee et al. . 2O10l ). Although 
many algorithms have been proposed, almost no satisfactory theoretical results exist for this prob - 



200?') 



lerr i. The only exception i s the analysis of the convex relaxation method (jd'Aspremont et al.l 
bv lAmini fc Wainwrightl ^200^ ) under the high dimensional spiked covariance model (iJohnstone . 



2001 



). However, the result was concerned with variable selection consistency under a very simple 
and specific example with limited general applicability. 

This paper proposes a new computational procedure called truncated power iteration method 
that approximately solves (II. ip . This method is similar to the classical power method, with an 
additional truncation operation to ensure sparsity. We show that if the true matrix A has a 
sparse (or approximately sparse) dominant eigenvector x, then under appropriate assumptions, this 
algorithm can recover x when the spectral norm of sparse submatrices of the perturbation E is small. 
Moreover, this result can be proved under relative generality without restricting ourselves to the 
rather specific spiked covariance model. Therefore our analysis provides strong theoretical support 
for this new method, and this differentiates our proposal from previous studies. We have applied 
the proposed method to sparse PCA and to the densest fc-subgraph finding problem (with proper 
modification). Extensive experiments on synthetic and real- world large scale datasets demonstrate 
both the competitive sparse recovering performance and the computational efficiency of our method. 

It is worth mentioning that the truncated power method developed in this paper can also be 
applied to the smallest k-sparse eigenvalue problem given by: 



mm X 



Ax, 



subject to ||x|| = 1, ||x||o < k, 



which also has many applications in machine learning. 



1.1 Notation 

Let = {A G W^P I A = A~^} denote the set of symmetric matrices, and = {A S 
S^, A ^ 0} denote the cone of symmetric, positive semidefinite (PSD) matrices. For any A G 
SP, we denote its eigenvalues by Amin(^) = Ap(A) < ••• < Xi{A) = Amax(^)- We use p{A) 
to denote the spectral norm of A, which is max{|Amin(^)|i |Amax(^)|}) and define p{A,s) := 
max{|Amin(^) s)!, |Amax(^5 s)|}- The i-th entry of vector x is denoted by [x]i while [A]ij denotes 
the element on the i-th row and j'-th column of matrix A. We denote by A^ any k x k principal 
submatrix of A and hy Ap the principal submatrix of A with rows and columns indexed in set 
F. If necessary, we also denote Ap as the restriction of A on the rows and columns indexed in F. 
Let be the ^p-norm of a vector x. In particular, ||x||2 = V x~^x denotes the Euclidean norm, 
||x||i = X^j^i \ denotes the ^i-norm, and ||x||o = #{j : [x]j ^ 0} denotes the ^o-norm. For sim- 
plicity, we also denote the £2 norm ||x||2 by ||x||. In the rest of the paper, we define Q{x) := x'^ Ax 
and let x^ be the optimal solution of problem (jl.ip . We let supp(x) := {j :| [x]j / 0} denote the 
support set of the vector x. Given an index set F, we define 

x{F) := argmaxx"^ Ax, subject to = 1, supp(x) C F. 

x&RP 

Finally, we denote by Jpxp the p x p identity matrix. 
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1.2 Paper Organization 



The remaining of this paper is organized as follows: Section [2] describes the truncated power 
iteration algorithm that approximately solves problem (jl.ip . In Section [3] we analyze the solution 
quality of the proposed algorithm. Section H] evaluates the practical performance of the proposed 
algorithm in applications of sparse PCA and the densest fc-subgraph finding problems. We conclude 
this work and discuss potential extensions in Section O 

2 Truncated Power Method 

Since Amax(^i k) equals Amax(^fc) where is the k x k principal submatrix of A with the largest 
eigenvalue, one may solve (jl.ip by exhaustively enumerate all subsets of {1, . . . ,p} of size k in order 
to find A'^. However, this procedure is impractical even for moderate sized k since the number of 
subsets is exponential in k. 

Therefore in order to solve the spare eigenvalue problem (jl.ip more efficiently, we consider an 
iterative procedure based on the standard power method for eigenvalue problems, while maintain- 
ing the desired sparsity for the intermediate solutions. The procedure, presented in Algorithm [21 
generates a sequence of intermediate /c-sparse eigenvectors xo,a;i, . . . from an initial sparse approx- 
imation xq. At each step t, the intermediate vector xt-i is multiplied by A, and then the entries 
are truncated to zeros except for the largest k entries. The resulting vector is then normalized to 
unit length, which becomes xt- It will be assumed throughout the paper that the cardinality k of 
supp(a;=K) is available a prior; in practice this quantity may be regarded as a tuning parameter of 
the algorithm. 

Definition 1. Given a vector x and an index set F , we define the truncation operation Truncate{x, F) 
to be the vector obtained by restricting x to F, that is 



Parameters : cardinality k G {l,...,p} 

Let t = 1. 
repeat 

Compute x[ = Axt-i /\\Axt-i\\. 

Let Ft = supp(x'j, k) be the indices of x[ with the largest k absolute values. 
Compute Xt = Truncate(xj, Fj). 
Normalize xt = xt/\\xt\\- 
t^t + l. 
until Convergence; 



Remark 1. Similar to the behavior of traditional power method, if A £ S^, then TPower tries to 
find the (sparse) eigenvector of A corresponding to the largest eigenvalue. Otherwise, it may find 




Algorithm 1: Truncated Power (TPower) Method 



Input : matrix ^ G S^, initial vector xq E 

Output : Xt 
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the (sparse) eigenvector with the smallest eigenvalue if —Xp{A) > Xi(A). However, this situation is 
easily detectable because it can only happen when Xp{A) < 0. In such case, we may restart TPower 
with A replaced by an appropriately shifted version A + XIpxp- 



3 Sparse Recovery Analysis 

We consider the general noisy matrix model (11. 2p . and are specially interested in the high dimen- 
sional situation where the dimension p of ^ is large. We assume that the noise matrix ii^ is a dense 
p X p matrix such that its sparse submatrices have small spectral norm p{E, s) for s in the same 
order of k. We refer to this quantity as restricted perturbation error. However, the spectral norm 
of the full matrix perturbation error p{E) can be large. For example, if the original covariance is 
corrupted by an additive standard Gaussian iid noise vector, then p{E,s) = s logp/n), which 
grows linearly in ^/s, instead of p{E) = 0{^Jp/n), which grows linearly in yfp. The main advantage 
of the sparse eigenvalue formulation (II. ip over the standard eigenvalue formulation is that the es- 
timation error of its optimal solution depends on p{E, s) with respectively a small s = 0{k) rather 
than p{E). This linear dependency on sparsity k instead of the original dimension p is analo gous 
to similar results for sparse regression (or compressive sensing) such as (ICandes k Tad , boosl ^. In 



fact the restricted perturbatio n error considered her e is analogous to the idea of restricted isometry 
property (RIP) considered in (jCandes Taol . boosl l. 



The purpose of the section is to show that if matrix A has a unique sparse (or approximately 
sparse) dominant eigenvector, then under suitable conditions, TPower can (approximately) recover 
this eigenvector from the noisy observation A. 

Assumption 1. Assume that the largest eigenvalue of A £ Ef is X = Amax(^) > that is non- 
degenerate, with a gap AX = A — maxj>i |Aj(A)| between the largest and the remaining eigenvalues. 
Moreover, assume that the eigenvector x corresponding to the dominant eigenvalue X is sparse with 
cardinality k = \\x\\q. 

We want to show that under Assumption [H if the spectral norm p{E, s) of the error matrix 
is small for an appropriately chosen s > k, then it is possible to approximately recover x. Note 
that in the extreme case ol s = p, this result follows directly from the standard eigenperturbation 
analysis (which does not require Assumption [1]) . 

We now state our main result as below, which shows that under appropriate conditions, the 
TPower method can recover the sparse eigenvector. The final error bound is a direct generalization 
of standard matrix perturbation result that depends on the full matrix perturbation error p{E). 
Here this quantity is replaced by the restricted perturbation error p{E, s). 

Theorem 1. We assume that Assumption\^ holds. Let s = 2k + k with k > Ak. Assume that 
p{E,s) < AX/2. Define 

X-AX + p{E^ V2piE,s) 
lis) ■■= r 7^r-s < 1, d{s) — 



^-piE,s) ' ^p{E,sY + {AX-2p{E,s)Y' 

If |xq x| >u + 6{s) for some \\xq\\q < k, ||xo|| = 1, and u G [0, 1] such that 

p, =(1 - j{sf)u{l - u^)/2 - {25{s) + (fcA)i/2) G (0, 1), 
P2 =^(1 + 3(fcA)V2)(i _ 0.45(1 - 7(^)2)) < 1, 
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then let /x = max(-^l — ^1,^2), we have 



1 - \xjx\ < [i^J 1 - \xlx\ + \/5(^(s)/(l - 1x2)- 



Remark 2. We only state our result with a relatively simple but easy to understand quantity 
p(E,s), which we refe r to as restricted perturbation error. It is analogous to the RIP concept in 



'Candes & Tad . \200A ). and is also directly comparable to the traditional full matrix perturbation 



error p{E) . While it is possible to obtain sharper results with additional quantities, we intentionally 
the theorem simple so that its consequence is relatively easy to interpret. 



Remark 3. Although we state the result by assuming that the dominant eigenvector x is sparse, 
the theorem can also be applied to certain situations that x is only approximately sparse. In such 
case, we simply let x' be a k sparse approximation of x. If x' — x is sufficiently small, then x' is 
the dominant eigenvector of a symmetric matrix A' that is close to A; hence the theorem can be 
applied with the decomposition A = A' + E' where E' = E + A — A' . 

Note that we did not make any attempt to optimize the constants in Theorem [H which are 
relatively large. Therefore in the discussion, we shall ignore the constants, and focus on the main 
message of Theorem [TJ If p(E,s) is smaller than the eigen-gap AA/2 > 0, then 7(5) < 1 and 
6{s) = 0{p{E, s)). It follows that under appropriate conditions, as long as we can find an initial 
xq such that 

\x^x\ >c{p{E,s) + {k/kf'^) 
for some constant c, then 1 — \xj x\ converges geometrically until 

\\xt-xf = 0{p{E,sf). 

This result is similar to the standard eigenvector perturbation result stated in Lemma [2] of Ap- 
pendix [Aj except that we replace the spectral error p{E,p) of the full matrix by p{E,s) that can 
be significantly smaller when s <C p. To our knowledge, this is the first sparse recovery result 
for the sparse eigenvalue problem in a relatively general setting. This theorem can be considered 
as a strong theoretical justification of the proposed TPower algorithm that distinguishes it from 
earlier algorithms without theoretical guarantees. Specifically, the replacement of the full matrix 
perturbation error p{Ef with p{E,sf gives the theoretical insights on why TPower works well in 
practice. 

To illustrate our result, we brief ly describe a consequence of the theorem under the s piked 
covariance model of (Johnstone, 2001 ) which was investigated by Amini Sz Wainwright ( 20091 ). We 



assume that the observations are p dimensional vectors 

Xi — X ~\~ C, 

for i = 1, . . . , n, where e ~ N{0, Ipxp)- For simplicity, we assume that ||x|| = 1. The true covariance 
is 



and A is the empirical covariance 



A — XX + Ipxp-i 



1 " 

i=l 
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Let E = A — A, then random matrix theory imphes that with large probability, 



p{E,s) = 0{\/ s Inp/n). 

Now assume that maxj \xj\ is sufficiently large. In this case, we can run TPower with a starting 
point Xq = Cj for some vector ej (where ej is the vector of zeros except the j-th entry being 
one) so that \ej x\ = \xj\ is sufficiently large, and the assumption for the initial vector |xqx| > 
c{p{E,s) + (fc/fe)V2) is satisfied with s = 0{k). We may run TPower with an appropriate initial 
vector to obtain an approximate solution xt of error 

\\xt — = 0{klnp/n). 



This i s optimal. Note that our results are not directly comparable to those of lAmini &: Wainwright 



(|2009l ). which studied support recovery. Nevertheless, it is worth noting that if maxj \xj\ is suffi' 



ciently large, then our result becomes meaningful when n = 0{k In p); however their result requires 
n = 0{k^ Inp) to be meaningful, although this is for the pessimistic case of x having equal nonzero 
values of l/Vfc. 

Finally we note that if we cannot find a large initial value with Ixg'sl, then it may be necessary 
to take a relatively large k so that the requirement l^^xl > c((fc//c)^/^) is satisfied. With such a 
k, p{E, s) may be relatively large and hence the theorem indicates that xt may not converge to 
X accurately. Nevertheless, as long as \xj x\ converges to a value that is not too small (e.g., can 
be much larger than [xg x|), we may reduce k and rerun the algorithm with xt as initial vector 
together with a small k. In this two stage process, the vector found from the first stage (with large 
k) is used to as the initial value of the second stage (with small k). Therefore we may also regard 
it as an initialization method to TPower. In practice, one may use other methods to obtain an 
approximate xq to initialize TPower, not necessarily restricted to running TPower with larger k. 
Some practical alternatives are discussed in Section [H 



4 Applications 



In this section, we illustrate the effectiveness of TPower method when applied to sparse principal 
component analysis (sparse PCA) (in Section [4. ip and the densest /c-subgraph (DkS) finding prob- 
lem (in Section l4.2p . The Matlab code for reproducing the experimental results reported in this 



section is online available at https: //sites. google. com/site/xtyuanlQSO/publications 



4.1 Sparse PCA 

Principal component analysis (PCA) is a well established tool for dimensionality reduction and 
has a wide range of applications in science and engineering where high dimensional datasets are 
encountered. Sparse principal component analysis (sparse PCA) is an extension of PCA that 
aims at finding sparse vectors (loading vectors) capturing the maximum amount of variance in 
the data. In recent years, various researchers have proposed various approaches to directly ad- 
dress t he conflicting g oals of explaining variance and achieving sparsity in sparse PCA. For in- 
stance, IZou et al.l (12006,'l formu l ated sparse PCA as a regression-type optimization problem and 



imposed the Lasso ( Tibshiraiii . 19961 ) penalty on the regression coefficients. The DSPCA algo- 



rithm in ( d'Aspremont et al. . 20071 ) is an ^i-norm based semidefinite relaxation for sparse PCA 



6 



Shen &: Huang resorted to the singular value decomposition (SVD) to compute low-rank ma- 



trix approximations of the data matrix under various sparsity- inducing penalt ies. Greedy search 
and branch-and-bound methods were investigated in (jMoghaddam et alJ . liooi ) to solve small in- 
stances of sparse PC A exact l y and to obtain approximate solutions for larger scale problems. More 



recently, Id'Asprernont et al.l (l200q') pro posed the use of greedy forward selection with a certificate 



of optimality, and Journee et al. ([2"oid ) studied a generalized power method to solve sparse PC A 
with a certain dual reformulation of the problem. In comparison, our met hod works directly in the 



primal domain, and the resulting algorithm is quite different from that of Journee et al. ( 20ld ) 



Given a sample covariance matrix, S E §^ (or equivalently a centered data matrix D G W^^^ 
with n rows o f p-dimensional observations vectors such tha t E = D]_D) and the target cardinality 
k, following ( Moghaddam et al. . 2006 : d'Aspremont et al. . 2007 . 20081 ). we formulate sparse PC A 
as: 



arg max x 



subject to 



1, ll^^llo ^ k. 



(4.1) 



The TPower method proposed in this paper can be directly applied to solve the above problem. 
One advantage of TPower for Sparse PGA is that it directly addresses the constraint on cardinality 
k. To find the top m rather than the top one sparse loading vectors, a com mon approach in the 
literature ( d'Aspremont et al. . 2007 : Moghaddam et al. . 2006 : Mackey . 20081 ) is to use the iterative 
deflation method for PGA: subsequent sparse loading vectors can be obtained by recursively re- 
moving the contribution of the previously fou nd loading vect ors from the covariance matrix. Here 
we employ a projection deflation scheme from (Mackey, 20081 ). which deflates an vector x using the 
formula: 



XX 



{Ipxp XX )Xl(/pxp 

Obviously, E' remains positive semidefinite. Moreover, S' is rendered left and right orthogonal to 



X. 



4.1.1 Connection with Existing Sparse PCA Methods 

In the setup of sparse PGA , TPo wer is closely related to GPower ( Journee et al. . 20 id ) and 
sPCA-rSVD ( Shen Sz Huang . 20081 ) which are both power-truncation-type iterative algorithms. 
Indeed, GPower and sPCA -rSVD are identical except for the initialization and post-processing 
Phases [Tourneeetal.l .B. Given a data matrix D G M"^p, the ^i-norm version of GPower (and 
equivalently sPGA-rSVD) solves the following regularized rank-1 approximation problem: 



mm 



\D 



zx 



T||2 



F + Tiklli) subject to ll^ll = 1, 



while the ^g-norm version of GPower (and sPCA-rSVD) solves the following optimization problem: 



mm 



\D 



zx 



T||2 



+ 7||x||o, subject to ||2;|| = 1. 



Given the covariance matrix S = D D, it is easy to verify that TPower optimizes the following 
constrained low-1 and semidefinite approximation problem 



min IIS 



xx'^W'j^, subject to 



Ij ||3;||o < k. 



Essentially, TPower, GPower and sPCA-rSVD all use certain power-truncation type procedure to 
generate sparse loadings. However, the difference between TPower and GPower (sPGA-rSVD) is 
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also clear: the former performs rank-1, semidefinite and sparse approximation to covariance matrix 
while the latter performs rank-1 and sparse approximation to the data matrix. One important 
benefit of TPower is that we are able to analyze solution quality such as sparse recovery capability, 
while analogous results are not available for GPower and sPCA-rSVD. 

Our method is also related to PathSPCA ( d'Aspremont et al. . 20081 ) that directly addresses the 



formulation (jl.ip . The PathSPCA method is a greedy forward selection procedure which starts 
from the empty set and at each iteration it selects the most relevant variable and adds it to the 
current variable set; it then re-estimates the leading eigenvector on the augmented variable set. 
Both TPower and PathSPCA output sparse solutions with exact cardinality k. 

4.1.2 On Initialization 

Theorem [1] suggests that the TPower algorithm can benefit from a good initial vector xq. In a 
practical implementation, the following three initialization schemes can be considered. 

1: One simple method is to set [xo]j = 1 on index j = argmaxj[yl]jj and otherwise. This 
initialization provides a l/A;-approximation to the optimal value, i.e., Q{xq) > Xuiax{A,k)/k. 
Indeed, if we let A^. be the k x k principle submatrix of A supported on supp(x=k), then 
it is easy to verify that [A]jj > Tr(^^)/A; > Aniax(^) ^)/^- In the setup of sparse PCA, 
this corresponds to initializing by selecti ng the variable with the largest variance, which 
is known to perform well for PathSPCA ( d'Aspremont et al.l . 20081 ). Alternatively, we may 



ini tialize xn as the indica t or vec tor of the top k values of the variances as is considered 



by lAmini &: Wainwright (|2009l ). 



2: A two-stage warm-start strategy suggested at the end of Section [3l In the first stage we may 
run TPower with a relatively large k and use the output as the initial value of the next stage 
with a decreased k. Repeat this procedure if necessary until the desired cardinality is reached. 
More generally, one may use other algorithms to warm start TPower. 

3: When fc ~ p, an initialization scheme suggested in (jMoghaddam et al.l . lioO^ ) can be employed. 



This scheme is motivated from the following observation: among all the m possible (m — 1) x 
(m — 1) principal submatrices of A^, obtained by deleting the j-th. row and column, there 
is at least one s ubmatrix Am-^ = ^m\ j whose maximal eigenvalue is a major fraction of its 



parent (see, e.g.- lHorn &: Johnsonl . Il99ll ): 

fn — 1 

3j £ {1, ...,m}, Amax(^m\j) > — :— Amax(^m). (4.2) 



m 



A gre edy backward elimination method is suggested using the above bound (jMoghaddam et al 



20061 ): start with the full index set {1, and sequentially delete the variable j which yields 



the maximum Amax(^m\j) until only k elements remain. It is immediate from the bound (14. 2p 
that this procedure will guarantee a /c/p-approximation to the optimal objective value. This 
scheme works well for relatively small p. When p is large, however, such a greedy initializa- 
tion scheme will be computationally prohibitive since it involves {p — k)p times of dominant 
eigenvalue calculation for matrices of scale 0{p x p). 

4.1.3 Results on Toy Dataset 

To illustrate the sparse recovering performance of TPower, we apply the algorithm to a synthetic 



dataset drawn from a sparse PCA model. We follow the same procedure proposed by (IShen &: Huang 
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20081 ) to generate random data with a covariance matrix having sparse eigenvectors. To this end, 



a covariance matrix is first synthesized through the eigenvalue decomposition S = VDV , where 
the first m columns of V £ M^^^ are pre-specified sparse orthonormal vectors. A data matrix 
X G W^^P is then generated by drawing n samples from a zero-mean normal distribution with 
covariance matrix S, that is X ~ A/'(0,S). The empirical covariance S matrix is then estimated 
from data X as the input for TPower. 

Consider a setup with p = 500, n = 50, and the first m = 2 dominant eigenvectors of S are 
sparse. Here the first two dominant eigenvectors are specified as follows: 

I 0, otherwise I 0, otherwise 

The remaining eigenvectors vj for j > 3 are chosen arbitrarily, and the eigenvalues are fixed at the 
following values: 

Ai = 400, 
A2 = 300, 

Xj = l, j = 3,...,500. 

We generate 500 data matrices and employ the TPower to compute two unit-norm sparse load- 
ing vectors ui,U2 G M^'^'^, which are hopefully close to and V2- Our method is compared on 



this dataset with a greedy algorithm PathPCA ( d'Aspremont et al. . 20081 ). a p o wer-iter ation-typ e 



method GPower ( Journee et al. . 20ld ). a sparse regression based method SPCA dZou et al.1 . boOfil ) 



and the standard PCA. For GPower, we test its two block versions GPower^j^^ and GPower^^^^ 
with £i-norm and £o-norm penalties, respectively. H e re we do not involve two r epresentative sparse 
PCA algorithms, the sPCA-rSVD (|Shen fc Huand . I2OO8I I and the DSPCA dd'Asoremont et al. 



20071 ). in our comparison sin ce the former is show n to be identical to GPower up to initialization 



and post-processing phases ( Journee et al. . 20ld ). while the latter is considered by the authors 



only as a second choice after PathSPCA. All tested algorithms were implemented in Matlab 7.12 
running on a commodity desktop. 

In this experiment, we regard the true model to be successfully recovered when both quantities 
jf^iiil and |vJm2| are greater than 0.99. We also assume that the cardinality A; = 10 of the 
underlying sparse eigenvectors is known a prior. Table ITT] lists the recovering results by the tested 
methods. It can be observed that TPower, PathPCA and GPower all successfully recover the ground 
truth sparse PC vectors with extremely high rate of success. SPCA frequently fails to recover the 
spares loadings on this dataset. The potential reason is that SPCA is initialized with the ordinary 
principal components which in many random data matrices are far away from the truth sparse 
solution. Traditional PCA always fails to recover the sparse PC loadings on this dataset. The 
success of TPower and the failure of traditional PCA can be well explained by our sparse recovery 
result in Theorem [1] (for TPower) in comparison to the traditional eigenvector perturbation theory 
in Lemma [2] (for traditional PCA), which we have already discussed in Section [3l However, the 
success of other methods suggests that it might be possible to prove sparse recovery results similar 
to Theorem [1] for some of these alternative algorithms. 

4.1.4 Speed and Scaling Test 

To study the computational efficiency of TPower, we list in Table 14.21 the CPU running time (in 
seconds) by TPower on several datasets at different scales. The datasets are generalized as n x p 
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Table 4.1: The quantitative results on a synthetic dataset. The values |M]^U2|, |wj^mi|, 1^2^*2! are 
mean of the 500 running. 



Algorithms 


Parameter 




\vl Ul\ 


^2 U2 


Probability of success 


TPower 


k 


= 10 





0.9998 


0.9997 


1 


PathSPCA 


k 


= 10 





0.9998 


0.9997 


1 


GPower^j^m 


7 


= 0.8 





0.9997 


0.9996 


0.99 


GPower£„,m 


7 


= 0.8 


9.5 X 10"^ 


0.9997 


0.9991 


0.99 


SPCA 


Ai 


= 10"^ 


2.4 X 10"^ 


0.9274 


0.9250 


0.25 


PCA 









0.9146 


0.9086 






Gaussian random matrices with fixed n = 500, and exponentially increasing values of dimension p. 
We set the termination criteria for TPower to be ||(5(xt) — (5(xt_i)|| < 10~^. It can be observed 
from Table 14.21 that TPower can exit within seconds or tens of seconds on all the datasets under a 
wider range of cardinality k. 



Table 4.2: Average computational time for extracting one sparse component (in seconds) by TPower 
under different setting of dimensionality p and cardinality k. 





n X p 


500 X 1000 


500 X 2000 


500 X 4000 


500 X 8000 


500 X 16000 


500 X 32000 


k 


= 0.05 X p 


0.07 


0.13 


0.39 


0.88 


2.52 


7.03 


k 


= 0.1 X p 


0.07 


0.14 


0.55 


1.00 


3.31 


20.05 


k 


= 0.2xp 


0.13 


0.21 


0.96 


2.00 


7.94 


21.78 


k 


= 0.5 X p 


0.12 


0.43 


2.06 


5.08 


19.63 


25.19 



4.1.5 Results on PitProps Data 

The Pitprops dataset (|jeffersl . E967l ). which consists of 180 observations with 13 measured variables 
has been a standard benchmark to evaluat e algorithms for sparse PCA (See, e.g 



Zou et al.1 . 12OO6I : 



Shen fc Hua^ . I2OO8I : Ijournee et all . l20ld ). Following these previous studies, we also consider to 
compute the first six sparse PCs of the data. In Table 14.31 we list the total cardinality and the pro- 
portion of a djusted variance jZou et al.l. l2006l ) ex plained by six compo nents comp uted with TPower , 
PathSPCA dd'Aspremont et al. . 2008, ). GPower (j.Tournee et al.l . l20inl ) and SPCA dZou et ahLliooi ). 
From these results we can see that on this relatively simple dataset, TPower, PathSPCA and 
GPower perform quite similarly. SPCA is inferior to the other three algorithms. 

Table 1331 lists the six extracted PCs by TPower with cardinality setting 7-2-1-1-1-1. We can see 
that the important variables associated with the six principal components do not overlap, which 
leads to a clear interpretation of the extracted components. The same loadings are extracted by 
both PathSPCA and GPower under the parameters listed in Table 14.31 



4.1.6 Results on Biological Data 

We have also eval uated the performa nce of TPower on two gene expression d atasets, one is the Colo n 
cancer data from ( Alon et al. . 19991 ). the other is the Lymphoma data from ( Alizadeh et al. . 2000l ). 
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Table 4.3: The quan titative results on the PitProps dataset. The result of SPCA is taken 



from (|Zou et all , hood ). 



Method 




Tofal rarnirialil'v 


Pvo'n of pynminpn variaripp 

A. ± \J tJ t KJl. CVV J.CIiJ.lJ.V^V-1. V CliJ. J.CIiJ.J.V-^Vj 


TPower 


(^QT-rlinalif ioc' 8 8 /19 9 9 

caruinaiiT^ies. o-o-4t-z-z-z 


9fi 
ZD 


U.oOoD 


TPower 


cardinalities: 7-2-3-1-1-1 


15 


0.8230 


TPower 


cardinalities: 7-2-1-1-1-1 


13 


0.7599 


PathSPCA 


cardinalities: 8-8-4-2-2-2 


26 


0.8615 


PathSPCA 


cardinalities: 7-2-3-1-1-1 


15 


0.8230 


PathSPCA 


cardinalities: 7-2-1-1-1-1 


13 


0.7599 


GPower^^^m 


7 = 0.22 


26 


0.8438 


GPower^^^m 


7 = 0.30 


15 


0.8230 


GPower^^^m 


7 = 0.40 


13 


0.7599 


SPCA 


see fZou et al. 2006) 


18 


0.7580 



Table 4.4: The extracted six PCs on PitProps dataset by TPower with cardinality setting 7-2-1-1- 
1-1. Note that in this setting, the extracted six PCs are orthogonal and the significant loadings are 
non-overlapping. 



PCs 


topd 


X2 
length 


X3 
moist 


X4 
testsg 


X5 
ovensg 


X6 
ringt 


X7 
ringb 


xs 
bowm 


xg 
bowd 


xio 
whorls 


Xll 

clear 


X12 
knots 


X13 
diaknot 


PCI 


0.4235 


0.4302 











0.2680 


0.4032 


0.3134 


0.3787 


0.3994 











PC2 








0.7071 


0.7071 





























PCS 














1.000 


























PC4 
































1.000 








PCS 


































1.000 





PC6 






































1.000 



Following the experimental setup in ( d'Aspremont et al. . 20081 ) . we consider the 500 genes with the 
largest variances. We plot the variance versus cardinality tradeoff curves in Figure 14. H together 



with the result from PathSPCA (jd'Aspremont et al.l . |2008| ) and the upper bounds of optimal values 



from ( d'Aspremont et al. . 20081 ). Note that our method performs almost identical to the PathSPCA 
which is demonstrated to have optimal or very close to optimal solutions in many cardinalities. The 
computational time of the two methods on both datasets is comparable and is less than two seconds. 



4.1.7 Results on Document Data 

In this section we evaluate the practical performance of TPower for key terms extraction on a 
document dataset 20 News groups (20NG). The 2ONG0 is a dataset collected and originally used for 
document classification by Lang] ( 19951 ). A total number of 18,846 documents, evenly distributed 
across 20 classes, are left after removing duplicates and newsgroup-identifying headers. This corpus 
contains 26, 214 distinct terms after stemming and stop word removal. Each document is then 
represented as a term- frequency vector and normalized to one. We use the top 1,000 terms according 
to the DF (document frequency) of the terms in the corpus. We extract 5 sparse PCs on this dataset. 
The cardinality setting for the 5 sparse PCs is 20-20-10-10-10. Table H3] lists the terms associated 



^http: //people . csail .mit . edu/jremiie/20Newsgroups/| 
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Sparse PCA ^ Sparse PCA 




Cardinality Cardinality 

(a) Colon Cancer (b) Lymphoma 

Figure 4.1: The synthetic dataset and outlier removal result. For better viewing, please see the 
original pdf file. 



with the 1st, 2nd and 5th sparse PCs. The interpretation is quite clear: the 1st sparse PC is about 
figures, the 2nd is about computer science, and the 5th is on religion. We have observed that quite 
similar terms are extracted by PathSPCA under the same cardinality setting. Here we do not list 
the 3rd and 4th PCs since they overlap with the listed ones due to the non-orthogonality of sparse 
PCs. 



4.1.8 Summary 

To summarize this group of experiments on sparse PCA, the basic finding is that TPower per- 
forms quite competitively in terms of the trade-off between ex plained variance and repr esen- 



tation sp arsity. The perform ance is comparable to PathSPCA (jd'Aspremont et al.l . l2008l ) and 



GPower ( Jour nee et al. . 20ld ) both on the synthetic a nd on the real da tasets. It is observed that 



TPower, PathSPCA and GPower outperform SPCA (|Zou et al.l . 120061 ^ on the benchmark data 



Pitprops. Although performing quite similarly, TPower, PathSPCA and GPower are different algo- 
rithms: TPower is a power iteration method while PathSPCA is a greedy forward selection method, 
both directly address the cardinality constrained sparse eigenvalue problem (jl.ip . while GPower is 
a power iteration method for certain regularized versions of sparse eigenvalue problem (see the pre- 
vious Section r4.1.ip . While strong theoretical guarantee can be established for the TPower method, 
it remains open to show that PathSPCA and GPower have a similar sparse recovery performance. 



4.2 Densest /c-Subgraph Finding 

As another concrete application, we show that with proper modification, TPower can be applied 
to the densest /c-subgraph finding problem. Given an undirected graph G = {y,E), \V\ = n, 
and integer 1 < k < n, the densest /c-subgraph (DkS) problem is to find a set of k vertices with 
maximum average degree in the subgraph induced by this set. In the weighted version of DkS we are 
also given nonnegative weights on the edges and the goal is to find a A;- vertex induced subgraph of 
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Table 4.5: Terms associated with the first 1st, 2nd, 5th sparse PCs. Dataset: 20NG. 
" 1st PC (20 terms) 2nd PC (20 terms) 5th PC (10 terms) 





"edu" 


"dont" 


"2" 


"system" 


"time" 


"3" 


"inform" 


"people" 


"4" 


"includ" 


"believ" 


"5" 


"support" 


"god" 


"10" 


"program" 


"exist" 


"20" 


"version" 


"christan' 


"15" 


"set" 


"jesus" 


"8" 


"window" 


"christ" 


"6" 


"softwar" 


"atheist" 


"16" 


"avail" 




"12" 


"file" 




"14" 


"data" 






"user" 




"18" 


"grafic" 




"9" 


"color" 




"13" 


"imag" 




"11" 


"displai" 




"0" 


"format" 




"la" 


"ftp" 





maximum average edge weight. Algorithms for finding DkS are usef ul tools for analyzing networks. 
In particular, they h ave been used to sel ect features for ranking (IGeng et al .. 2007), to identify 



cores of communities ( Kumar et al. . 19991 ). and to combat link spam ( Gibson et al. I, |2005|). 



It has been shown that the DkS problem is NP har d for bipartite gra phs and chordal graphs (ICorneil Sz Peril . 



1984), and even for graphs of maximum degree three ( Feige et al. . 200ll ). A large body of algorithms 



have been proposed based on a variety of techniques includi ng greedy algorithms (IFeige et al.l. 12001 



Asahiro et al.l . l2002l : lRavi et al.l . ll994l ). linear programming (IBillionnet Roupin 



2004 



Khuller fc Saha 



20091 ). and semidefinite pr ogramming (ISrivastav &: Wola . 119981 : lYe &: Zhangi . |2003| ) . For general k, 
the algori thm developed bvlFeige et al.l (|200ll ) achieves the best approximation ratio of 0{n'') where 
e < 1/3. iRavi et al.l (|l994l ) proposed 4-approximation algorit hms for weighted DkS on complete 



graphs for which the weights satisfy the triangle inequality. iLiazi et al.l (120081) has p resented a 
3-approximation algorithm for DkS for chordal graphs. Recently, Jiang et al. ( 20ld ) proposed 
to reformulate DkS as a 1-mean clustering problem and developed a 2- approximatio n to the re- 
formulated clustering problem. Moreover, based on this reformulation, lYan 

3 (!20ld ) proposed a 

1 + e-approximation algori thm w i th ce rtain exhaustive (and thus expensive) initialization proce- 
dure. In general, however, Khot ( 20061 ) showed that DkS has no polynomial time approximation 
scheme (PTAS), assuming that there are no sub-exponential time algorithms for problems in NP. 
Mathematically, DkS can be restated as the following binary quadratic programming problem: 



max 7r~''l^7r, subject to vr G {1,0}", ||7r| 



(4.3) 
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where W is the (non-negative weighted) adjacency matrix of G. If G is an undirected graph, then 
W is symmetric. If G is directed, then A could be asymmetric. In this latter case, from the fact 
that tt^Wti = Ti^ w+w may equivalently solve Problem (j4.3p by replacing W with , 

Therefore, in the following discussion, we always assume that the affinity matrix W is symmetric 
(or G is undirected). 



4.2.1 The TPower-DkS Algorithm 

We propose the TPower-DkS algorithm as a slight modification of TPower, to solve the DkS prob- 
lem. The process generates a sequence of intermediate vectors ttq, tti, ... from a starting vector ttq. 
At each step t the vector vr^-i is multiplied by the matrix W, then is set to be the indicator 
vector of the top k entries in Wnt-i- The TPower-Dks is formally given in Algorithm [2l 



Algorithm 2: Truncated Power Method for DkS (TPower-DkS) 
Input :We§1„ initial vector ttq G M" 

Output : TTt 

Parameters : cardinality k G {l,...,n} 

Let t = 1. 
repeat 

Compute vr^ = WiTt^i. 

Identify Ft = supp(7rj, k) the index set of tt^ with top k values. 
Set -Kt to be 1 on the index set Ft, and otherwise. 
t^t + l. 
until Convergence; 



Remark 4. By relaxing the constraint vr G {0, 1}" to ||7r|| = ^/k, we may convert the densest 
k-subgraph problem (j4.3p to the standard sparse eigenvalue problem (jl.ip (up to a scaling) and then 
directly apply TPower (in Algorithm 1) for solution. Our numerical experience shows that such a 
relaxation strategy also works satisfactory in practice, although is slightly inferior to TPower-DkS 
(in Algorithm 2) which directly addresses the original problem. 

Note that in Algorithm [2] we require that W is positive semidefinite. The motivation of this 
requirement is to guarant ee the convexi t y of t he objective in problem ()4.3p . and thus following 
the similar arguments in ( Journee et al. . 20ld ) it can be shown that the objective value will be 



monotonically increasing during the iterations. In many real-world DkS problems, however, it is 
often the case that the affinity matrix W is not positive semi-definite. In this case, the objective is 
non-convex and thus the monotonicity of TPower-DkS does not hold. However, this complication 
can be circumvented by instead running the algorithm with the shifted quadratic function: 

max TT^ (W + XIr,xv)'^i subject to vr G |0, ||7r||o = k. 

where A > is large enough such that W = W + XIpxp G . On the domain of interest, this change 
only adds a constant term to the objective function. The TPower-DkS, however, produces a different 
sequence of iterates, and there is a clear trade-off. If the second term dominates the first term (say 
by choosing a very large A), the objective function becomes approximately a squared norm, and 
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the algorithm tends to terminate in very few iterations. In the Umiting case of A — )• oo, the method 
will not move away from the initial iterate. To handle this issue, we propose to gradually increase A 
during the iterations and we do so only when the monotonicity is violated. To be precise, if at a time 
instance t, TrJWnt < TTj_iW7rt-i, then we add XIpxp to W with a gradually increased A by repeating 
the current iteration with the updated matrix until ttJ (W + XIpxp 
which implies TrJWTTt > TTj_iW7rt-i. 



4.2.2 On Initialization 



Since TPower-DkS is a monotonically increasing procedure, it guarantees to improv e the initial 



point ttq. Basical l y, any existing approximation DkS method, e.g., greedy algorithms (jFeige et al. 



2OOII : iRavi et al.l . Il994l ). can be used to initialize TPower-DkS. In our numerical experiments, we 



observe that by simply setting ttq as the indicator vector of the vertices with the top k (weighted) 
degrees, our method can achieve very competitive results on all the real- world datasets we have 
tested on. 



4.2.3 Results on Web Graphs 

We have tested TPower on four page-level web graphs: cnr-2000, amazon-2008, ljournal-2008, 
hollywood-2009, from the WebGraph framework provided by the Laboratory for Web Algorithms El- 
We treated each directed arc as an undirected edge. Table 14.61 lists the statistics of the datasets 
used in the experiment. 



Table 4.6: The statistics of the web graph datasets. 



Graph 


Nodes {\V\) 


Total Arcs {\E\) 


Average Degree 


cnr-2000 


325,557 


3,216,152 


9.88 


amazon-2008 


735,323 


5,158,388 


7.02 


ljournal-2008 


5,363,260 


79,023,142 


14.73 


hollywood-2009 


1,139,905 


113,891,327 


99.91 



We compare our TPower- DkS method with two greedy methods for the DkS problem. One 



greedy method is proposed bv lRavi et al.l ()1994l ) which is referred to as Greedy-Ravi in our experi- 
ments. The Greedy-Ravi algorithm works as follows: it starts from a heaviest edge and repeatedly 
adds a vertex to the current subgraph to maximize the weight of the resulting new su bgraph; this 



proces s is repeated until k vertices are chosen. The other greedy method is developed by lFeige et al 



(|200ll . Procedure 2) which is referred as Greedy-Feige in our experiments. The procedure works as 
follows: let S denote the k/2 vertices with the highest degrees in G; let C denote the k/2 vertices 
in the remaining vertices with largest number of neighbors in S; return S L) C. 

Figure 14.21 shows the density value iT~^WTT/k and CPU time versus the cardinality k. From 
the density curves we can observe that on cnr-2000, ljournal-2008 and hollywood-2009, TPower- 
DkS consistently outputs denser subgraphs than the two greedy algorithms, while on amazon-2008, 
TPower-DkS and Greedy-Ravi are comparable and both are better than Greedy-Feige. For CPU 



^Note that the inequaUty ttJ{W + XIpxp)ivt > tvJ^i{W + \Ipy,p)itt-i is deemed to be satisfied when A is large 
enough, e.g., when W + XIpxp G S" 



^Datasets are available at http: //lae .dsi .unimi . it/datasets .phpj 
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running time, it can be seen from the right column of Figure 14.21 that Greedy- Feige is the fastest 
among the three methods while TPower-DkS is only slightly slower. This is due to the fact that 
TPower-DkS needs iterative matrix-vector products while Greedy-Feige only needs a few degree 
sorting outputs. Although TPower-DkS is slightly slower than Greedy-Feige, it is still quite efficient. 
For example, on hollywood-2009 which has hundreds of millions of arcs, for each Greedy-Feige 
terminates within about 1 second while TPower terminates within about 10 seconds. The Greedy- 
Ravi method is however much slower than the other two on all the graphs when k is large. 



4.2.4 Results on Air- Travel Routine 

We have applied TPower-DkS to identify subsets of American and Canadian cities that are most 
easily connected to each other, in terms of estimated commercial airline travel time. The graph 
is of size \V\ = 456 and \E\ = 71,959: the vertices are 456 busiest commercial airports in United 
States and Canada, while the weight Wij of edge eij is set to the inverse of the mean time it takes 
to travel from city i to city j by airline, including estimated stopover delays. Due to the headwind 
effect, the transit time can depend on the direction of travel; thus 36% of the weight are asymmetric. 



Figure 3(a) shows a map of air-travel routine. 

As in the previous experiment, we compare TPower-DkS to Greedy-Ravi and Greedy-Feige on 
this dataset. For all the three algorithms, the densities of A;-subgraphs under different k values are 



shown in Figure [3(b)| and the CPU running time curves are given in Figure |3(c)[ From the former 



figure we observe that TPower-DkS consistently outperforms the other two greedy algorithms in 
terms of the density of the extracted fc-subgraphs. From the latter figure we can see that TPower- 
DkS is slightly slower than Greed-Feige but much faster than Greedy-Ravi. Figure |3(d)|3(e)[ and 



3(f)| illustrate the densest fc-subgraph with A; = 30 outputted by the three algorithms. In each of 
these three subgraph, the red dot indicates the representing city with the largest (weighted) degree. 
Both TPower-DkS and Greedy-Feige reveal 30 cities in east US. The former takes Cleveland as the 
representing city while the latter Cincinnati. Greedy-Ravi reveals 30 cities in west US and CA and 
takes Vancouver as the representing city. Visual inspection shows that the subgraph recovered by 
TPower-DkS is the densest among the three. 

After discovering the densest fc-subgraph, we can eliminate their nodes and edges from the 
graph and then apply the algorithms on the reduced graph to search for the next densest subgraph. 
Such a sequential procedure can be repeated to find multiple densest A;-subgraphs. Figure 3(g) |3(h)" 



and 3(i) illustrate sequentially estimated six densest 30-subgraphs by the three algorithms. Again, 
visual inspection shows that our method output more geographically compact subsets of cities than 
the other two. As a quantitative result, the total density of the six subgraphs discovered by the 
three algorithms is: 1.14 (TPower-DkS), 0.90 (Greedy-Feige) and 0.99 (Greedy-Ravi), respectively. 



5 Conclusion and Future Work 

The sparse eigenvalue problem has been widely studied in machine learning with applications such 
as sparse PGA. TPower is a truncated power iteration method that approximately solves the non- 
convex sparse eigenvalue problem. Our analysis shows that when the underlying matrix has sparse 
eigenvectors, under proper conditions TPower can approximately recover the true sparse solution. 
The theoretical benefit of this method is that with appropriate initialization, the reconstruction 

^The data is available at [www.psi . toronto . edu/af f initypropogatioii| 
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Figure 4.2: Identifying densest /c-subgraph on four web graphs. Left: density curves as a function 
of k. Right: CPU time curves as a function of k. 
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Densest k-Subgraph Densest k-Subgraph 
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(d) TPower-DkS (e) Greedy-Ravi (f) Greedy-Feige 




(g) TPower-DkS (h) Greedy-Ravi (i) Greedy-Feige 

Figure 4.3: Identifying densest /c-subgraph of air-travel routing. Row 1: Route map, and the 
density and CPU time evolving curves. Row 2: The densest 30-subgraph discovered by the three 
algorithms. Row 3: Sequentially discovered six densest 30-subgraph by the three algorithms. 

quality depends on the restricted matrix perturbation error at size s that is comparable to the 
sparsity k, instead of the full matrix dimension p. This explains why this method has good empir- 
ical performance. To our knowledge, this is the first theoretical result of this kind, although our 
empirical study suggests that it might be possible to prove related sparse recovery results for some 
other algorithms we have tested. 

We have applied TPower to two concrete applications: sparse PCA and the densest fc-subgraph 
finding problem. Extensive experimental results on synthetic and real-world datasets validate the 
effectiveness and efficiency of the TPower algorithm. 
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Appendix 

A Proof of Theorem [1] 

We state the following standard resul t from the perturba t ion th eory of symmetric eigenvalue prob- 
lem. It can be found for example in (jGolub &: Van Loan . ll99fil V 



Lemma 1. If B and B + U are p x p symmetric matrices, then VI < A; < 

Xk{B) + Xp{U) < Xk{B + U)< \k{B) + Ai(C/), 
where Xk{B) denotes the k-th largest eigenvalue of matrix B. 

Lemma 2. Consider set F such that supp{x) C F with \F\ = s. If p{E,s) < AA/2, then the ratio 
of the second largest (in absolute value) to the largest eigenvalue of sub matrix Ap is no more than 
7(5). Moreover, 

-T ,z.M, V2p{E,s) 



|x' -x{F)\\ < 6{s) :-- 



^p{E,sY + {^\-2p{E,s)Y 
Proof. We may use Lemma [1] with B = Ap and U = Ep to obtain 

\i{Ap) > Xi{Ap) + Xp{Ep) > \i{Ap)-p{Ep) >\-p{E,s) 

and Vj > 2, 

\\j{Ap)\ < \Xj{Af)\ + p{Ep) < A - AA + p{E, s). 

This implies the first statement of the lemma. 

Now let x{F), the largest eigenvector of Ap, be ax + Px', where ||x||2 = ||3;'||2 = !> x~^x' = 
and + = 1, with eigenvalue A' > A — p{E, s). This implies that 

aApx + /3Apx' = X! [ax + fix'), 

implying 

ax''^Apx + px'^Apx = X'f3. 

That is, 

x'~^ Apx \x'~^ Apx\ \x'~^ Epx\ 

" '"'a'-x'T^^x' - '"'a'-x'T^^x' " '"'a'-x'T^^x' - 

where t = p{E,s)/{AX - 2p{E,s)). This implies that 0^(1 + t^) > q2 + /32 = 1^ and thus > 
1/(1 + Without loss of generality, we may assume that a > 0, because otherwise we can replace 
x with —X. It follows that 

- xf = 2 - 2x(F)^x = 2 - 2q < < 2^ 



\/rn^ - i+t2- 

This implies the desired bound. □ 
The following result measures the progress of untruncated power method. 
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Lemma 3. Let y be the eigenvector with the largest (in absolute value) eigenvalue of a symmetric 
matrix A, and /et 7 < 1 be the ratio of the second largest to largest eigenvalue in absolute values. 
Given any x such that \\x\\ = 1 and y^ x > 0; let x' = Ax/\\Ax\\, then 

> + (1 - 7')(1 - (yTa;)2)/2]. 

Proof. Without loss of generality, we may assume that Xi{A) = 1 is the largest eigenvalue in 
absolute value, and |Aj(74)| < 7 when j > I. We can decompose x as x = ay + Py' , where y~^y' = 0, 
l|y|| = lly'll = 1' s-^d + = 1. Then \a\ = \x'^y\. Let z' = Ay', then ||z'|| < 7 and y~^ z' = 0. 
This means Ax = ay + f3z' , and 

J |a| ^ |a| 



\y ' x'\ = '".. ^ ' = ' ' — > 



\\Ax\\ v/a2 + ^2||_2/||2 - + ^2^2 



Vl-(l-72)(l-(yW) 

>\y^x\ [l + (l-72)(l-(yT^)2)/2]. 



The last inequality is due to 1/ — z > 1 + z/2 for z G [0, 1). This proves the desired bound. □ 

Lemma 4. Consider x with supp{x) = F and k = \F\. Consider y and let F = supp{y,k) be the 
indices of y with the largest k absolute values. = ||y|| = 1, then 

\Truncateiy,F)'^x\ > |y^x| - {k/k)''/^mm [l, (1 + (k/kf/^) (1 - (y^x)^) 



Proof. Without loss of generality, we assume that y^x = A > 0. We can also assume that 
A > y^k/{k + k) because otherwise the right hand side is smaller than zero, and thus the result 
holds trivially. 

Let Fi = F\F, and F2 = FnF, and F3 = F \ F. Now, let a = \\xfi\\, /3 = llx^aH, a = \\yFi\\, 
/3 = II2/F2II, and 7 = Hi/FsH- let ki = \Fi\, k2 = |i*2|, and k^ = {F^]. It follows that o? jkx < jkj,. 
Therefore 

A^ < \aa + /3/3]2 < + /S^ < 1 - 7^ < l _ {k^/ki)a^. 

This implies that 

< {ki/k3){l - A2) < {k/k){l - A^) < A^, (A.l) 

where the second inequality follows from k < k and the last inequality follows from the assumption 
A > \/k/(k + k). Now by solving the following inequality for a: 



aa 



+ vT 



a 



> aa + (3/3 > A > a > aa, 



we obtain that 
a<aA + \f\ 



a2vr 



^2 < 



mm 



l,a + \/l-A2 



< min 



l,(l + (^/A:)i/2)v/l_A2 , (A.2) 



where the second inequality follows from the Cauchy-Schwartz inequality and A < 1, \/\ — o? < 1, 
while the last inequality follows from (lA.ip . Finally, 

|7/'''x| — |Truncate(?/, -F)~''x| < | (y — Truncate(?/, F))'''x| 

<aa< {k/kf/^ min [l, (1 + (k/kf/^) (1 - {y'^ xf)] , 

where the last inequality follows from (lA.ip and (IA.2h . This leads to the desired bound. □ 
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Next is our main lemma, which says each step of sparse power method improves eigenvector 
estimation. 

Lemma 5. Let s = 2k + k. We have 

\xlx\ > {\xUx\ - 5{sm + (1 - 7(^)')(1 - {\xUx\ - 5{s)f)/2] - 5{s) - {k/kfl\ 
If\xJ_ix\ > 1/ y/?, + 5{s), then 

\Jl - \xjx\ < fi2\/ 1 - kfLi^l + V56{s). 

Proof. Let F = Ft-i U Ft U supp(x). Consider the fohowing vector 

x't = AFXt^i/WApXt-iW, (A.3) 

where Ap denotes the restriction of A on the rows and columns indexed by F. We note that 
replacing x[ with x[ in Algorithm [2] does not affect the output iteration sequence {xt} because of 
the sparsity of xt-i and the fact that the truncation operation is invariant to scaling. Therefore 
for notation simplicity, in the following proof we will simply assume that x[ is redefined 
according to (jA.Sp . 

Based on the above notation, we have 

\x'^x{F)\ > \xUx{F)\[l + {l-^{sf){l-{xUx{F)f)/2] 

> {\xUx\ - 6{s))[l + (1 - 7(^)')(1 - {\xUx\ - 6{s)f)/2], 

where the first inequality follows from Lemma [3l and the second is from Lemma [2] and |x^]^x| > 
\xJ_ix{F)\ — d{s), and the fact that x{l + (1 — 7^)(1 — x'^)/2) is increasing when x G [0, 1]. We can 
now use Lemma [2] again, and the preceding inequality implies that 

1x^x1 > i\xl,x\ - 6{s))[l + (1 - 7(^)')(1 - i\xUx\ - 6{s)f)/2] - 6{s). 

Next we can apply Lemma [4] to obtain 

\xjx\ > \x'Jx\ - {k/kf^ 

> {\xl,x\ - 5is))[l + (1 - jisf){l - {\xUx\ - 6{s)f)/2] - 6{s) - {k/kfl\ 

This leads to the first desired inequality. 

Next we will prove the second inequality. Without loss of generality and for simplicity, we may 
assume that x'^ x{F) > and xJ_iX > 0, because otherwise we can simply do appropriate sign 
changes in the proof. We obtain from Lemma [3] that 

> xlMF) [1 + (1 - 7(^)')(1 - {xUx{F) f)/2]. 

This implies that 

[1 - <[1 - xlMF)] [1 - (1 - 7(^)')(1 + x7_ix(F))(x7_ix(F))/2] 

<[l-xJ_MF)] [1- 0.45(1 -7(s)')], 
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where in the derivation of the second inequahty, we have used Lemma [2] and the assumption of the 
lemma that imphes xJ_ix{F) > xj_^x — 6{s) > l/\/3. We thus have 

\\x't - x{F)\\ < \\xt-i - x{F)\\ - 0.45(1 -7(s)2). 
Therefore using Lemma [21 we have 

< Wxt^i -x\\ -0.45(1 -7(s)2) + 2(5(s). 

This is equivalent to 

^Jl- \x'Jx\ < Y^l- |x7_iS|\/l -0.45(1 -7(s)2) + ^/2(5(s). 
Next we can apply Lemma [Hand use k/k < 0.25 to obtain 

^1 - \xjx\ < \J 1 - |x'/s| + ((fc//t)V2 + k/k){l - |x'/x|2) 

< ^l-|x'/x|-^l + 3(fc/fc)V2 

< /X2y''i - \xt-i^x\ + Vb5{s). 

This proves the second desired inequality. □ 
Proof of Theorem [T] 

We know if > n + 5{s), then Lemma [5] implies: 

|x7x| >{\xU^\ - 5(«))[1 + (1 - l{sf ){l - (|x7_ix| - 5(s))2)/2] - {6{s) + Ck/k)'/') 
>u[l + (1 - j{sf){l - u^)/2] - {6{s) + (k/kf^) >u + 5{s). 

The first inequality uses Lemma [5l the second inequality uses z{l + (1 — 7^)(1 — ^^)/2) is an 
increasing function of z S [0, 1]; and the third inequality uses the assumption of u in the theorem. 
This implies (by an easy induction argument) that we have \xj x\ > u + 5 for all t > 0. 

Now we can prove the theorem by induction. The bound clearly holds at t = 0. Assume it holds 
at some t — 1. If we have Ix^Li^l < l/\/3 + 5{s), then since z{l — z^) is increasing in [0, l/-v/3], and 
from Lemma [5] we have 

1x7^1 >{\^Ux\ - 5{sW + (1 - 7(«)')(1 - (k7-iS| - 5{s)f)/2 - {5{s) + {k/k)''') 
>\x1_^x\ - 5{s) + (1 - 7(s)2)u(l - u')/2 - {5{s) + {k/kf^) 

>\x1-lX\ + III. 

Combing this inequality with \xj x\ = \x~J x\/\\xt\\ > l^Jx] we get 

1 — 1x7^1 < 1 — \xj x\ < (1 — /Ui)[l — |x^]^x|], 

which implies the theorem at t. 

If I xj_ix\ > l/\/3 + S{s), then we have from Lemma[5] 

•^/l — \xj x\ < \Jl — \xj x\ < ^J'2\J^ — \x't-i^\ + \/5<5(s), 
which implies the theorem at t. This finishes induction. 
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